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Abstract — Ordinal regression is commonly formulated as a 
multi-class problem with ordinal constraints. The challenge of 
designing accurate classifiers for ordinal regression generally 
increases with the number of classes involved, due to the large 
number of labeled patterns that are needed. The availability 
of ordinal class labels, however, is often costly to calibrate or 
difficult to obtain. Unlabeled patterns, on the other hand, often 
exist in much greater abundance and are freely available. To take 
benefits from the abundance of unlabeled patterns, we present 
a novel transductive learning paradigm for ordinal regression in 
this paper, namely Transductive Ordinal Regression (TOR). The 
key challenge of the present study lies in the precise estimation of 
both the ordinal class label of the unlabeled data and the decision 
functions of the ordinal classes, simultaneously. The core elements 
of the proposed TOR include an objective function that caters 
to several commonly used loss functions casted in transductive 
settings, for general ordinal regression. A label swapping scheme 
that facilitates a strictly monotonic decrease in the objective 
function value is also introduced. Extensive numerical studies 
on commonly used benchmark datasets including the real world 
sentiment prediction problem are then presented to showcase the 
characteristics and efficacies of the proposed transductive ordinal 
regression. Further, comparisons to recent state-of-the-art ordinal 
regression methods demonstrate the introduced transductive 
learning paradigm for ordinal regression led to the robust and 
improved performance. 

Index Terms — Transductive Learning; Ordinal Regression; 
Ordinal Classification; Ordinal Loss Function; Support Vector 
Machines; Cluster Assumption; 

I. Introduction 

Ordinal regression (OR) is generally defined as the task 
where some input sample vectors are ranked on an ordinal 
scale [?], [?], [?]. In a five-star movie rating, for instance, the 
higher the rating, the better a movie is perceived to be. This 
rating can be configured as ordinal class labels {1,2,3,4,5}, 
which represents the number of stars a particular movie can 
be awarded. Hence the class labels are imbued with ordered 
information, i.e., a sample vector associated with class label 2 
has a higher rating (or better) than another having class label 1, 
and having class label 3 is better off than having class label 1 
and 2, and so on. Ordinal regression is also sometimes referred 
to interchangeably in the literature, as ordinal classification 
or multi-class classification models [?], [?], [?] with ordered 
classes. Today, ordinal regression of movie ratings such as the 
prediction of movie sentiment ratings, represents an important 
task of the sales personnel as part of their marketing strategy. 
Besides sentiment prediction, ordinal regression is also used 
in a wide area of applications that ranges from information 
retrieval [?], [?], collaborative filtering [?], medical analysis 
[?], gene expression analysis [?], to employee selection and 
prediction of pasture production [?]. 
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Computer Engineering, Nanyang Technological University, Singapore 639798, 
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Initial efforts pertaining to the use of support vector (SV) 
learning in ordinal regression was reported by Herbrich et 
al. [?]. Their work is based on a threshold model as shown 
in Fig. [T] in which the threshold values of each ordinal 
class are estimated. Then, Shashua and Levin [?] introduced 
two approaches for ordinal regression using the large margin 
principle. The first approach maximizes the margin between 
adjacent classes, whereas the other maximizes the sum of 
K — 1 margins, with K denoting the number of classes. 

Both explicit and implicit constraints on the order of the 
thresholds in the model formulation, referred to as S VOR-EXC 
and SVOR-IMC in [?], [?], have also been considered recently. 
Li and Lin [?] extended their work with a framework that 
transforms the problem of ordinal regression to an extended 
binary classification, as a generalization of both SVOR-EXC 
and SVOR-IMC. By deriving the thresholds directly from 
the support vectors, a more efficient alternative, namely the 
Reduction Support Vector Machine, was introduced. Last but 
not least, as opposed to using all n data points, Zhao et 
al. [?] considered k cluster representatives as the training 
data in SVOR-EXC, leading to significant reduction in the 
computational complexity, especially for large scale dataset 
since k <^n. 

To summarize, the field of ordinal regression has evolved 
in the last decade, with a plethora of noteworthy research 
progress made in supervised learning [?], [?], [?], [?], [?], 
[?], [?], [?], [?], [?]. In spite of the extensive work on this 
topic, existing methodologies proposed for ordinal regression 
may be fundamentally bounded by the lack of sufficient class 
labels found in the data. In particular, it is worth noting that 
ordinal class labels are often difficult to obtain. Specific tasks 
such as gene expression [?] and cell-phenotype images [?] are 
generally costly to annotate and calibrate due to the need for 
biological experts. Further, in many realistic applications of 
science and engineering, it may happen that deriving the labels 
involves hazardous experiments or the assessment of the label 
involves extreme conditions in resources [?]. A well known 
example is the movie sentiment problem where ordinal labels 
of movie ratings are scarce. Moreover, learning all the ordinal 
boundaries (between pairs of consecutive classes) generally 
requires considerable amount of labeled data due to the large 
number of unique class labels involved. Unlabeled data, on 
the other hand, exists in much greater abundance and are 
often freely available at zero cost. To take benefits from the 
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TABLE I 

A summary of ordinal regression and related algorithms 
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abundance of unlabeled patterns, the objective of the present 
paper is to introduce a novel transductive learning paradigm 
for ordinal regression, referred to here as Transductive Ordinal 
Regression or TOR in short. 

The key challenge of TOR design lies in the appropri- 
ate incorporation of unlabeled data within the multi-class 
classification problem formulation with ordinal constraints. 
This involves the tasks of estimating the ordinal class label 
of the unlabeled data and the decision function of multiple 
ordinal classes simultaneously. In TOR, we consider both 
p(x) and p{y\x.). In particular, using the p(x) of both labeled 
and unlabeled data, we avoid decision boundaries that lie 
in high density regions (i.e. p(x)) [?] by means of cluster 
assumption [?], [?]. In addition, the extension of classical 
OR to a Transductive OR paradigm is also non-trivial. To 
be more precise, current Transductive approaches are not 
designed to function well on ordinal regression (multi-clas^ 
with ordering information) problems. Taking this cue, we 
present in this paper a novel transductive learning paradigm 
for ordinal regression [?], [?]. In particular, we formulate 
the ordinal-class problem as an extended binary classification 
problem, such that the ordinal constraints can be implicitly 
enforced. Subsequently, a proposed label swapping scheme 
for multiple class transduction is introduced to derive ordinal 
decision boundaries that pass through low density region of 
the augmented labeled and unlabeled data. 

A summary of some existing state-of-the-art ordinal regres- 
sion approaches and the TSVM is outlined in Table U where 
the major similarities and differences are explicitly identified 
with respect to 'the type of decision boundaries', 'the number 
of classifiers to train for K ordinal classes' and 'whether or 
not cluster assumption and ordinal constraints are imposed'. 
Notably, TSVM requires K classifiers in order to learn the 
label of unlabeled data for all K classes at the same time. 
As such the training process of TSVM is much more time 
consuming and complex compared to ORs or TOR, since the 
latter approach only requires single classifier to be trained. 
Further, the prediction process of TSVM involves K classifiers 
and does not take the ordinal constraints into considerations. 

'For multi-class without ordering information, readers are referred to [?], 

m, [?]. 



With only a single classifier, the training process of ORs and 
TOR is clearly more efficient. 

For the sake of brevity, the core contributions of the present 
study are outlined as follows: 

1) A transductive learning paradigm of ordinal regression 
involving labeled and unlabeled data for learning ordinal 
decision functions is introduced. To the best of our 
knowledge, the present work serves as the first attempt 
that addresses the general ordinal regression problem in 
a transductive setting for a family of commonly used loss 
functions including hinge loss, logistic loss, Laplacian 
loss and others listed in Table 

2) A label swapping scheme for multiple ordinal class 
transduction is introduced. The proof of strictly mono- 
tonic decrease in the objective function is also derived 
for the swapping scheme. The proposed transductive 
ordinal regression algorithm is thus established. 

3) Numerical study showed that the TOR achieves signifi- 
cant accuracy improvements in terms of mean zero-one 
and absolute errors when pitted against other state-of- 
the-art algorithms for ordinal regression and transductive 
support vector machines. 

The rest of this paper is organized as follows: A brief 
introduction of ordinal regression is provided in Section 
Hn Section Hill introduces the transductive ordinal regression 
(TOR) algorithm. Subsection IIII-AI details the initialization 
of the pseudo-labels for unlabeled data, while the ordinal 
loss function used in transductive learning by means of label 
swapping to minimize the structural risk is described in 
subsection IIII-BI The parameters that control the importance 
of the labeled and unlabeled data used in the loss function are 
then discussed in subsection IIII-CI Section |IV] generalizes a 
family of well established binary functions as potential loss 
functions in TOR. An instantiation of TOR with hinge loss is 
also presented in the section. Extensive experimental results 
on four benchmark datasets and the real-world sentiment 
prediction problem are reported in Section |V] Analysis and 
discussions pertaining to the experimental results are then 
provided in Section IVII while the brief conclusions of the 
present work are drawn in Section IVII I 
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II. Review of Ordinal Regression 

A. Notation 

Throughout the rest of this paper, a superscript ^ denotes 
the transpose of a vector or a matrix. Given n labeled 
samples: {xi,yi), (x2,j/2), (x„,y„) in the data set, where 
Xi G di^ represents the ith sample with ordinal class label 
Ui e {1,2, K}. Consider also a threshold model such as 
that depicted in Fig. [l] where a K ordinal class problem 
has K — 1 ordered thresholds: 6*1 < 6*2 < ... < 6k-i- 
Thus, a sample, x, is classified as Class i when the predictive 
output /i(x) = w-'^x falls in the range of 6i-i < h{x) < 9i, 
where w e 3?^*, and = — oo and Ok = oo are typically 
assumed. For example, a Class 2 label implies an output that 
lies between Oi and 6*2. 

B. Ordinal Regression as an Extended Binary Classification 
Model 

Ordinal regression using a threshold model generally con- 
siders the extended binary classification problem [?] of the 
form: 

x^^ = (x,,efc)G5RP+^-\ 
= l-2/[y, <fc], 

for k — 1,2,...,K — 1. Here e^: G 5R^^^ denotes a vector 
with the kth element as value 1 and the rest of the elements 
having value zero, and J[ ] denotes an indicator function that 
returns 1 if the predicate holds, otherwise a zero is returned. 
Essentially, each labeled sample x^ in the original data set is 
duplicated K — 1 times, and the fcth copy is augmented with 
Bk and is assigned with a binary label in the transformed 
problem. 

A binary classifier with a weight vector 

w = (w, -0) e 3fiP+^'-\ (2) 

is then learned to predict such that (w, —6)'^x^ = 
w^Xi — Ok- Hence, the threshold 9^ of the threshold model 
is estimated using feature augmentation. Subsequently, the 
predictive ordinal class label of each sample, x^, is computed 
as: 

K-l 

/(x,) = 1 + ^ /[.9(xf ) > 0] (3) 

fc=i 

where g{x^) = w-^x*^ = (w, —6)'^x.^ — w^x; — 9k ^ 
h{xi) — 6k and /[•] is an indicator function that returns 1 
if the predicate holds, otherwise a is returned. 

In this manner, besides inheriting the theoretical rigors of 
binary classifiers, typical caching and optimization techniques 
such as Sequential Minimal Optimization (SMO) [?], [?] can 
also be used in ordinal regression. 

III. Transductive Ordinal Regression 

In this section, we present the essential components of the 
proposed TOR algorithm for Ordinal Regression. In particular, 
we consider the ordinal regression problem where n labeled 
samples: (xi, y^), (x2, 2/2), (x„, ?/„) and u unlabeled sam- 
ples: x„+i,x„+2, •••,x„+„ are available. In what follows, we 
introduce a novel transductive learning paradigm, referred to 



Algorithm 1 Transductive Ordinal Regression (TOR) 



Class 2 loss function 

Class K-l loss function 




Fig. 2. Loss function for each class in K ordinal class problem 

here as Transductive Ordinal Regression (TOR), for inferring 
the labels (y* = y„+2, 2/«+u}) of u number of 

unlabeled data instances and modeling the prediction function, 
h{x.), by minimizing the structural risk functional of the form: 

n 

min T{h,e)+CiJ2^y.iH^t),0) 

n+u j-^x 

+C2 ^yAH^i)^0) 
s.t 9k < 9k+iyk e {1,...,K -2} 

where t is the regularizer that controls the complexity of 
h and 6, and Ci and C2 are the parameters that trade-offs 
the amount of regularization against the loss function £y-{-) 
on the labeled data and unlabeled data, respectively. Recall 
that ordinal regression involves a K class problem, hence the 
loss function in (|4|i can be represented by K loss functions, 
where each loss function represents a class depicted in Fig. |2] 
In another words, each sample, x^, with a class label, yi, 
possesses a loss function represented by iy^{h{xi),9). 

Through (|4|i, TOR simultaneously learns the order of the 
decision boundaries, 6, and at the same time the pseudo-labels 
of unlabeled data with the decision boundaries are enforced 



1: Parameters: C\ 

2: Inputs: a training set including labeled and unlabeled 
samples Dl=(xi,?/i),...,(x„,?/„) and Du=x 

3: Outputs: predicted labels of Djj 

II Initialization of unlabeled data's class label 

4: assign y* using Algorithm |2] 
// transductive learning 



set C2 =some small value (e.g. 10 ) 



while C2 < Ci do 
repeat 

{-w, 6):= solve ^ by fixing y* 
9: for int k ~l;k < K;k++ do 
10: if 3(z,j) satisfying dU then 

11: if there is more than one choose the one 

with the largest decrease in the loss value 

12: y^ = k + l 

13: yj = k 

14: end if 

15: end for 

16: until no label is swapped 

17: C2 = C2 * 2 

18: end while 
19: return y* 
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Algorithm 2 Initialization of pseudo-labels for Unlabeled Data 
1: Parameter: Ci 

2: Inputs: a training set including labeled and unlabeled 

samples L)L=(xi,yi),...,(x„,y„) and Du=yi 
3: Outputs: y* of Du 

II Start of algorithm 
4: Count the number of samples nurat in Dl that fall into 

Class k and then compute ratioi. = — 
5: (w, 0):=solve ^ with C2 = (i.e. without Du) 
6: Compute the predicted value, w^x,, of Vx, £ Du 
1: Sort in ascending order of the predicted value to form 

a sorted D^j 
8: for int k = l;k<K;k++ do 

9: assign the first ratiok of unassigned samples in D^j 

with label k 
10: end for 

11: assign the rest of unassigned samples in Z?^ as K 
12: return y* 



to fall on low density regions of both labeled and unlabeled 
data, while satisfying the cluster assumption. In this manner, 
majority of the data vectors in the /cth ordinal class would lie 
in the range of thresholds, 6k-i and 0^, while loss function 
(■y-{-) then caters to the remaining data {a.k.a, the outliers) 
that violates the cluster assumption. 

Solving (|4|i optimally would involve trying out all the 
possible combinations of assignment for y*, resulting in a 
NP hard problem. Hereafter, (|4) is solved by first finding h 
and while fixing y*, then applying the swapping scheme to 
update y* and repeating the entire process until convergence 
is reached as outhned in Algorithm [T] 



A. Pseudo-labels of Unlabeled Data Initialization 

The initialization phase of the TOR focuses on assigning 
initial pseudo-labels to the unlabeled data. By using a large 
margin criterion, the optimization problem may lead to trivial 
solutions, e.g., all unlabeled data are classified with positive 
labels [?], [?]. The common practice in transductive learning 
is to impose some class ratio constraints on the eventual 
labels of the unlabeled data {e.g., assuming balanced class 
distribution), where such an assumption has been shown 
to mitigate the issue of unbalanced output distribution and 
improves prediction performances [?]. Taking this cue, in the 
TOR, the pseudo-labels of the unlabeled data are constrained 
to match the class distribution of the labeled data. In particular, 
the constraints are fulfilled implicitly through the procedure of 
first training a supervised OR classifier on available labeled 
data and subsequently sorting the unlabeled data according to 
the values predicted by the trained supervised OR classifier. 
Pseudo-labels are then assigned to the sorted set with respect 
to the class distribution of the labeled data. The procedure 
to initialize the pseudo-labels of unlabeled data is outlined in 
Algorithm |2] 




'yi-l 



Fig. 3. Two consecutive class loss functions 

B. Transductive Learning by Label Swapping 

After the initialization phase to define the structural risk 
functional of (|4]i, the minimization of (|4]l proceeds with a 2- 
steps label swapping procedure. The first involves fixing y* 
to solve h and 0. Next, both the derived h and 9 are in-turn 
fixed to locate suitable y* that minimizes objective (|4|. In what 
follows, we define the criterion of the ordinal loss function to 
arrive at solution y* that minimizes objective (|4|. 

Definition 1. Loss function ty^i-) is defined with the following 
properties: 

1) Vi, j Vi = yj - 1, /i(xj) = ft.(xj), /(xj) < yj 

2) \fi,j yi = yj - 1, /i(xj) = h{yij), /(x^) > yi 

Def. [T] defines the relationship between two consecutive 
classes. Referring to Fig.|2] a class k loss function is penalized 
in both directions. For example, the figure depicts a class 2 
loss function consisting of a left and a right slanted line. In 
addition, the relationship between the left section (line) of two 
consecutive classes is depicted in Fig. [3] (which is a close up 
version of Fig. |2|i and satisfies the first property of Def. [T] 
In particular, two adjacent class loss functions with the same 
predicative value, h, suggests the lower class loss function 
exhibits a smaller loss value, In the same manner, the 2nd 
property of Def. [T]defines the right section of the loss function. 

Using the loss function governed by Def. [T] in what follows, 
we present the details on minimizing the structural risk func- 
tional using the proposed label swapping scheme to reduce the 
loss term in (|4]l. In order to minimize the objective of transduc- 
tive ordinal regression in (3), the following proposition which 
extends Theorem 2 in [?] from binary class problems to K 
ordinal class problems, is introduced to cater for the ordinal 
constraints defined on the unlabeled data. 

Proposition 2. For an ordinal loss function defined in Def. 
|7] swapping the label of two samples x^ and Xj from two 
adjacent classes yi and yj, i.e., yi = yj — 1, © observes a 
strictly monotonic decrease when f{:x.i) > yi and fi'x.j) < yj. 

Proof: According to Def. [T] the first property as- 
sures £y.^i{h{xj),6) < £y.{h{xj),9) and the sec- 
ond property assures ly.+i{h{'x.i),6) < £y^{h{xi),9). 
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Hence, £y^+i(/i(x,), 6>) + £j,^.„i(/i(x,), 6>) < £j,,(/i(x,), 6») + 
£y^{h{xj),6) holds. Through swapping, the last term in (|4) 
will follow a strictly monotonic decrease for fixed h and 6. 
After the swapping, a new decision function h' and 9' will be 
learned for (|4). Since (|4} is a minimization problem, we have: 

n n+u 

t(/z',0') + Ci^^^,(/i'(xO,0') + C2 J2 ^y.(h'{-Kj),e') 

n n-\-u 

<T{h,e) + CiY,iyM^,),e) + C2 ^ £^^(Mx,),0). 

■ 

Motivated by Proposition |2] and in the spirit of [?], we 
propose the swapping of labels between two consecutive 
classes (i.e. Class k and fc + 1) on unlabeled data for a pre- 
dictive function h and threshold values 0, when the following 
conditions have been met: 

n + l < <n + u,y,::^ k,yj = k + 1, 

This ensures (|4]l to strictly decrease upon each swap. 

When more than a pair of (ij) satisfying the conditions 
in (|5]l exists, the pair contributing to the largest decrease in 
the loss value is selected. Intuitively, this can be viewed as 
choosing the pair with highest information gain through the 
strategya 



C. Control Parameters 

Ci and C2 denote the control parameters of the proposed 
TOR detailed in Algorithm [T] In particular, Ci regulates 
the tradeoff between mis-classification errors on the labeled 
samples and the model complexity. In the same way, C2 
regulates the tradeoff for the unlabeled samples. Ci denotes 
a user-specified parameter whereas C2 is heuristically derived 
in TOR. Typically, C2 is initialized with some small value and 
gradually increased to approach Ci, in the spirit of [?]. This is 
a common heuristic strategy used to reduce the possibility of 
premature convergence and getting stuck in poor approximate 
solution when assigning the labels of the unlabeled data. 
Note that, when C2 tends to zero, the algorithm becomes 
a typical supervised learning problem. Therefore, increasing 
C2 gradually transforms the problem of ordinal regression to 
TOR. When the stopping criterion pertaining to C2 is reached 
in TOR, the assigned ordinal class label for the unlabeled 
data is deemed to converge. Hence, Algorithm [T] serves as 
a form of heuristic local search for solving (0) by means of 
approximation. 

^Note that the training time of this algorithm can be improved by swapping 
the labels from a set of unique pairs [?] since Proposition |2] guarantees the 
objective value in J4j to decrease. The study of improving the training time by 
swapping more than one pair for binary class problems has been shown in [?]. 
However, premature convergence might result in poor solutions. Hence, there 
is a tradeoff between the convergence of the training process and the quality 
of the solution by swapping more than one pairs. For simplicity, swapping 
only a pair of labels for each adjacent class is considered in the present study. 



TABLE II 

A family of binary loss functions can be used in our framework 



Function 


Formulation of loss I k (a,) 


Hinge Loss 
Square Hinge Loss 
Logistic Loss 
Square Loss 
Laplacian Loss 


max{0, 1 - (a)} 
(max{0,l-s/,'=(a)})2 

iog{i + e-y^°-) 



IV. Generalizing the family of binary loss 

FUNCTIONS IN TOR 

In this section, we generalize a family of existing binary 
functions for use as potential loss function in TOR. In partic- 
ular, subsection II V- Al defines how K — 1 binary functions can 
be used as the loss function in TOR. Then, an instantiation of 
TOR with hinge loss is subsequently showcased in subsection 
IIV-BI Next, label swapping of TOR for K ordinal problem is 
discussed in subsection IIV-CI 



A. Superimposing extended binary functions as the loss func- 
tion of TOR 

Using the representation in the extended binary classi- 
fication model, binary loss functions that fit in to fulfill 
the properties of Def. [T| (via superimposing K — 1 binary 
loss functions £yk{-) defined for each extended binary class 
e { — 1,1} of ([B) is as follows: 

K-l 

fc=i 

where x.^ is defined in ([T]) which incorporates 9k- Each binary 
loss function, iyk{-), has the following properties: 

Definition 3. Binary loss function iyk{-) is defined as follows: 

1) Va>0 h{-a)>£i{a), 

2) Vi iyk{a) = e_yk{-a) 

In Def. [3j the first property defines the binary loss function 
for y^ — 1, where higher loss value is assigned to a misclas- 
sified sample relative to one that has been correctly inferred. 
The last property of Def. |3] defines symmetrical positive and 
negative class loss functions. 

Proposition 4. The loss function superimposing K ~1 binary 
loss functions that fulfills Def. \3\ also fulfills Def. Q] 

Proof: Let us first prove the first property of Def. [T] We 
suppose that yi — yj — 1, /i(xi) = h{x.j) and /(xj) < yj. 
From (|6]l, to prove iy.{h{yii),6) < iy.{h{yij),0) is the 

same as proving Y.k=i ^a? (.9(xf )) - Ef="i^ f-y'; (5(4')) < ^■ 
Assume, to the contrary, so 

K-l K-1 

Y,eyk{gixf))-J2lykig{x';))>0, 

k=l k=l 
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Fig. 4. Loss function iy^i-) using the hinge loss and K = 5 with 6i 

4, 02 = 8,03 = 12, 6*4 = ie. 



from (|6]l, we have 

K-l K-l 

fc=i fc=i 

= £i(g(x,^)) + - E^i(.9(4)) 



fc=i 



- ^-i(.9(4)) 

k=yi + l 



k=l 



k=l 



K-l 



- E (since /t(x,) = /i(xj)) 

k=yi + l 

= -^i(g(xf))+^_i(g(xf)) 
= -^i(g(xf))+^i(-.g(xf)). 

The last equality is derived from the second property of 
Def. [3] Since /(xj) < yj and = — 1, and from 
(|3]l, we have X^aLi^ > 0] < yi, which implies 

g{x^') < 0, or alternatively ~g{xf^) > > g{xf'). From 
the first property of Def. |3] we have £i{~g{x^')) strictly less 
than ^i(5(xf )). Therefore, -^i(g(xf )) + £i(-.g(xf )) < 0, 
indicates a contradiction. In the same manner, the second 
property of Def. [T] can be proven to hold. ■ 

Therefore, a family of binary loss functions fulfilling the 
properties in Def. |3] summarized in, but not limited to Table 
nil can be used to minimize the structural risk functional of 
TOR framework in (|4). The readers are referred to [?], [?] for 
more details on these loss functions. 



B. An Instantiation of TOR using Hinge loss 

As mentioned in Section ITV-AI our proposed framework can 
cater to several commonly used loss functions that satisfies 
Def. [3] to minimize the structural risk functional of (3). Here, 
we illustrate an instantiation of TOR based on the hinge loss, 
since it is commonly used in SVM and satisfies Def. [3] For a 
particular labeled data,{xj, yi}, and using the extended binary 
classification model representation with the bias term included 



in the decision function, the extended binary loss function 
£yk{-) for a particular threshold 6k can be derived as: 



max{0,l-2/f(w^xf -6)} 



(7) 



where both the 6^ augmented and are defined in ([T]l 
and Q, respectively. 

From (|7]l, the ordinal loss function £y^{-) superimposing the 
if — 1 parts satisfies Def. [T] and becomes: 



(8) 



K-l 

Eniax{0,l-y,f(w^x,^-5)} 

k=l 

as depicted in Fig. |4l 

LetTih,e) = i||w||- - 2„ 2 

(|2|l) and the ordinal loss function £y-{-) as (O, then considering 
the structural risk of labeled data in (|4), the extended binary 
classification formulation for ordinal regression [?] can be 
derived as: 



"^"wll^ + (as derived from 



mm 



n K-l 

ClEE^^ 

s.t. yf(w^</.(x,)-0fe-&) > 

>0, VjG{l,...,n},fce{l,...,/f-l}, 

(9) 

where (j) : ^Rp i-^ J' denotes the nonlinear feature mapping 
induced by a kernel function, and w is also in J'. Thus, the 
decision functions in ^ become nonlinear by virtue of the 
kernel trick [?]. denotes the slack variable that caters for 
the error committed by x^ at the fcth decision boundary. 

With transductive learning, the labels of the unlabeled data 
in (IHl through ^ are then optimized by: 



1 



n K-l 



mm 

y,w,6,0,^^ 



Jii^ii^+^iEEe 



i=l k=l 



n+u K-l 

E E 

J— n+l A;— 1 

s.t. y,f(w^0(xO-efc-6)>l-^f, 

it>0, Vze{l,...,n},fce{l,...,if-1}, 
y^^(w^0(x,)-0,-5)>l-C^^ 
> 0, Vj e {n+l,...,n + u}, 

ke{i,...,K~i}. 

(10) 

Note that the ordered constraints on the thresholds in (3) are 
implicitly fulfilled in ^ and ( fTOl l (see the proof in [?]). Recall 
that, {y„+i, y„+2, yn+u} is denoted by y*. For a fixed y*, 
the dual form of the inner minimization problem in (fTOi then 
becomes: 

n+u K-l 

'"^^ E E 

^ 7l-\-U 7l-\-U K — 1 

1=1 j = l k=l k' = l 

S.t. 0<af < Ci, yi G {l,...,n},ke {1,...,K -1} 
< < C2, Vj e {n + 1, n + u}, 



ke{l,...,K-l}, 



(11) 
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where K(x^,xj^') = (/)(xi)-^(/)(x^) + ejcfc/ is the resultant 
kernel evaluation of x^' and x*^', and is the Lagrangian 
multiplier for the inequality constraint in (ITOt . Note this dual 
is in the form of a quadratic programming (QP) problem, and 
thus can be easily solved using standard SVM solvers. 

In Algorithm [T] one can use ( fTOl l to solve (|4]i while fixing 
y* and then apply the swapping scheme Q to update y*. The 
entire process is then repeated until convergence is reached. 

C. Discussion of label swapping for K ordinal class problem 

The proposition |2] for TOR is a generalization of K 
ordinal class problem, hence the proposition also applies to 
the binary class problems described in [?]. However, the 
TSVM in [?] cannot handle ordinal classification problems 
elegantly. For example, a data {x, y ~ 3} in a 5 class 
problem can be augmented to form binary data using ([!} 
as {(x, ei), 1}, {(x, ea), 1}, {(x, 63), -1}, {(x, 64), -1}. 
However, swapping with another data vector may cause the 
dataset to violate the ordinal properties defined in ([T) (e.g., 
{(x, ei), -1}, {(x, 62), 1}, {(x, 63), -1}, {(x, 64), -1}). In 
contrast, proposition |4] proved that TOR addresses this 
elegantly by generalizing the ordinal loss function to include 
commonly used binary loss functions. 

V. Experiments 

In this section, we investigate the efficacy of several state-of- 
the-art ordinal regression algorithms and the proposed trans- 
ductive ordinal regression, which are described in Table |] on 
a set of benchmark datasets and the task of sentiment predic- 
tion. Since existing ordinal regression models can deal with 
labeled data only, comparison to three ordinal state-of-the-art 
algorithms trained with labeled data, are also considered in 
the present studvfnamely RED-SVIV0 using ©, SVOR-EXC0 
and SVOR-IMOS). 

To investigate the effect of cluster assumption on the 
unlabeled data, comparison to the Multi-class transductive 
SVM (M-TSVM) [?] is also considered by using a multi- 
class training paradigm. In the experimental study, the M- 
TSVM is trained using both labeled and unlabeled data based 
on a one-versus-rest approach. Since the performance of M- 
TSVM is very sensitive to the balance constraints on the 
labels of the unlabeled data, a strategy similar to that proposed 
in Section IIII-AI i.e., taking the class ratio, ratiok, from 
the labeled data, as the balance constraints imposed on the 
labels of the unlabeled data, is also considered for M-TSVM. 
Taking the fcth class for example, the constraint enforces the 
proportion of Class k to the rest of the unlabeled data as 
ratiok '. 1 — ratiok. With the inclusion of M-TSVM, the 
impacts of ordinal knowledge on the performance metrics can 
be analyzed. 

A. Experimental Setup 

For each data set, the labeled data are randomly split into 
different sizes (100, 150, 200, 250, 300, 350 and 400). Let s 

'http://www.work.caltech.edu/~htlin/program/libsvm/#ordinal 
''http://www.gatsby.ucl.ac.uk/~chuwei/svor.htm 



denotes the sample size of each dataset described in Tables Hill 
and IIVI s — 400 samples then form the set of unlabeled data. 

The cost parameter Ci of each algorithm is determined 
using a five-fold cross-vahdation procedure with logioCi £ 
{—3, —2, —1, 0, 1, 2, 3, 4, 5}. To report statistically significant 
results on the unlabeled data, the average test performances of 
20 independent realizations are presented. 

To measure the classification error of the samples, mean 
zero-one error is employed as the performance metric and is 
defined as: 

n-\-u 

where /[•] denotes an indicator function that returns 1 if the 
predicate holds, otherwise a is returned, and y* and yj are 
the predicted label of the respective algorithm and the true 
class label, respectively. 

To measure how far the predicted class label of the samples 
differ from their true class label, the mean absolute error is 
employed here as the performance metric, which is defined as: 

-. n-\-u 
2— n+l 

where | • | denotes the absolute operation. 
B. Benchmark data sets 



TABLE m 

Benchmark datasets for ordinal regression 



Dataset 


Sample Size 


# Features 


Abalone 


4,177 


8 


Bank 


8,192 


32 


CaHfomia 


20,640 


8 


Census 


22,784 


16 



Four commonly used benchmark dataset^ (Abalone, Bank, 
California and Census) in ordinal regression problems are con- 
sidered in the present study. The statistics of these benchmark 
datasets are summarized in Table [III] These datasets were 
preprocessed with a quantization level of K = b. For all 
algorithms, we considered the perceptron kernel [?], which 
is defined as follows: 

Ap - ||x - x'||2, 

where Ap denotes a constant. As discussed in [?], perceptron 
kernel can be used by SVM to construct infinite ensemble of 
classifiers over perceptrons. In other words, the resultant SVM 
classifier using perceptron kernel is equivalent to a neural 
network with one hidden layer containing infinite hidden 
neurons. Moreover, based on the Kamsh Kuhn Tucker (KKT) 
conditions, Y^^^^i Y^^^i '^iVi = as derived from (fTTT i. the 
term Ap can be set to zero without changing the objective 
value of the dual SVM formulation [?]. As such, here we 

^http://www.liaad. up.pt/~ltorgo/Regression/DataSets. html 



8 



consider the simpimed perceptron kernel with Ap = in the sentiment prediction 

experimental studjo 



Dataset 


Sample Size 


# Features 


Book 


5,501 


17,862 


DVDs 


5,118 


19,059 


Electronics 


5,901 


10,728 


Kitchen Appliances 


5,149 


9,230 



C. Synthetic data set 



A synthetic data set with various degrees of cluster assump- 
tion is created based on our generator described in Algorithm^ 
to study the performances of transductive TOR versus non- 
transduction RED-SVM. 



Algorithm 3 Synthetic Data Set Generator 

1: Inputs: y G [1,../^], where K is the number of ordinal 

classes, p is a parameter to control the strength of cluster 

assumption 

2: for int d = 1; d < 2Qm{K + 2); d++ do 

3: if d e [2000(y - 1), 2000(?/ + 2)] then 

4: if rand()< 0.01 then 

5: x"^ =rand() 

6: else 

7: x'' = 

8: end if 

9: else 

10: if rand()< 0.0 Ip then 

11: x"^ =rand() 

12: else 

13: x'^ = {) 

14: end if 

15: end if 

16: end for 

17: return x 



Recall that the cluster assumption holds when each class is 
more separable by a particular set of features, hence line |3] 
in Algorithm [3] defines the set of features Sy belonging to 
a particular class y. Specifically, a rand() function is used 
to generate a number x'^, which is randomly drawn from a 
uniform distribution in the interval of and 1. To simulate 
input vectors with < 0.01 probability of sparse features for 
x'^ £ Sy, we define x"^ — rand{), otherwise x'^ = 0. To 
define the degree of cluster assumption on feature x'^ ^ Sy, we 
introduce parameter p and assign feature x'^ with some random 
at probability of O.Olp; otherwise, x'^ = 0. Note, a higher p 
value lead to greater overlapping among classes, thus a lower 
degree of cluster assumption. In the experiment, we consider 
p = (0.0, 0.1, 0.9) and K = We randomly generate 20 
sets of 2500 examples, and use 200 examples as the labeled 
data while the remaining as unlabeled data. In addition, the 
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Fig. 5. Mean Zero-One Error on benchmark datasets 



D. Sentiment data sets 

The task of sentiment prediction is to predict the star rating 
of each review. The datasets for sentiment predictiorQ as 
defined [?] were generated from Amazon.com, and comprise 
four categories of product reviews: Book, DVDs, Electronics 
and Kitchen appliances. The reviews consist of five ordinal 
rating label ranging from 1 to 5. A higher rating means a 
better review feedback. The details pertaining to the sample 
and feature size of the sentiment datasets are summarized in 
Table |IV] 

In the experimental study, we further preprocessed the 
datasets by removing all stop-words, normalizing each feature 
and performing stemming. Finally, each feature of a review is 
represented by its respective f/-/4f value. The inner product of 
two reviews is defined using the cosine similarity, with linear 
kernel used in the experiments. 



Perceptron kernel was reported to offer competitive results to Gaussian 
Kernel [?], but a benefit of perceptron kemel lies in the higher computational 
efficiency, which has been shown to be more than 10 times faster than 
Gaussian Kemel. Furthermore, perceptron kemel does not have any additional 
kemel parameter to be configured. In some previous study on ordinal 
regression problems [?], [?], the perceptron kemel was also reported to attain 
higher accuracies than using Gaussian Kemel. 
'www.cs.jhu.edu/~mdredze/datasets/sentiment/ 
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Fig. 6. Mean Absolute Error on benchmark datasets 

VI. Discussions on Experimental Results 
A. Results on Benchmark and synthetic Datasets 

On the benchmark and synthetic datasets, we performed 
experiments for K = b\o assess the predictive performance of 
various state-of-the-art algorithms. The experimental results of 
benchmark and synthetic datasets are discussed in subsections 
IVI-All and IVI-A2I respectively. 

1) Mean Zero-One and Absolute Errors on Benchmark 
Dataset: The results of mean zero-one error for each bench- 
mark dataset are summarized in Fig. |5] As observed from 
the figures, both SVOR-IMC and SVOR-EXC exhibit sim- 
ilar results on all the datasets considered. RED-SVM on 
the other hand manifests significant improved performances 
over SVOR-IMC and SVOR-EXC on all the datasets, which 
is in line with that obtained in [?]. Notably, the proposed 
transductive ordinal regression algorithm, TOR, exhibits the 
best performances across all experiments. As shown in Fig. |5] 
TOR reports a minimum of 2% and up to 6% improvements, 
relative to SVOR-IMC and SVOR-EXC. 

As discussed in [?], the data in high dimensional feature 
space such as text documents and sentiment data usually 
follows the cluster assumption. From the Table |lll] and |IV] 
the number of features of the Bank, Census and Sentiment 
data sets are higher From the results reported in Fig. |5] 
we observed that the improvements of performance of TOR 
over RED-SVM are higher on the Bank and Census. This is 
possibly due to the Bank and Census having higher feature 
dimension so the datasets satisfy the cluster assumption better 

On the manifest of transductive learning, M-TSVM displays 
the worst performance on most of the experiments, relative to 
the other algorithms considered, especially on the California 
and Census datasets in Fig. |5] This is unsurprising since M- 
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Fig. 7. Analysis of RED-SVM using the class distribution of the labeled 
data for classification (i.e., the label initialization phase of TOR), on the 
dataset with various strengths of cluster assumption. SubFigs. (a) and (b) 
depict the mean zero-one and mean absolute errors, respectively. A higher p 
value weakens the cluster assumption. 
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Fig. 8. Analysis of TOR, on the dataset with various strengths of cluster 
assumption, after the label initialization phase (i.e., RED-SVM using the class 
distribution of the labeled data for classification). SubFigs. (a) and (b) depict 
the differences (improvements) of mean zero-one and mean absolute errors, 
respectively, between TOR reaches convergence in Algorithm[T]and after TOR 
initializes the labels. A higher p value weakens the cluster assumption. 



TSVM is designed to deal with multi-class problems that does 
not make use of ordinal information available in the data. 
Without the use of ordinal knowledge, transduction to infer the 
correct label of unlabeled data becomes ever more challenging. 

Next, we analyze the mean absolute errors of the benchmark 
regression dataset depicted in Fig. |6] The results indicate that 
M-TSVM, which does not impose any ordinal constraints, per- 
formed badly on all the datasets, as observed in the subfigures. 
On the other hand, algorithms that use the ordinal information 
are noted to attain competitive mean absolute errors. While 
emerging as superior in mean zero-one error, TOR did not top 
in terms of mean absolute error We hypothesize this is due 
to the datasets containing continuous response variables, i.e., 
regression problems that have been manually quantized into 5 
ranks. In Section rVI-A2l we will validate our hypothesis on a 
synthetic dataset. 

2) Mean Zero-One and Absolute Errors on a Synthetic 
Ordinal Regression Dataset: Here, we analyze the label 
swapping procedure of the transductive approach, i.e., TOR, 
after the non-transductive approach, i.e., RED-SVM, using the 
class distribution of the labeled data for classification (the 
label initialization phase of TOR). The results summarized in 
Figure |7] indicate that the mean zero-one and absolute errors 
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Fig. 9. Mean Zero-One Error varies different Ci values 



of non-transductive RED-SVM approach deteriorates with 
decreasing degrees of cluster assumptions (i.e., configured via 
increasing parameter p). Similarly, the proposed transductive 
approach, i.e., TOR, which leverages the cluster assumption of 
the unlabeled data, exhibits lower improvements in mean zero- 
one and absolute errors when the degree of cluster assumption 
decreases (i.e., p > 0.2), as depicted in Figure |8] On the 
other extreme, when the cluster assumption holds strong [i.e., 
p = 0), the improvements in both mean zero-one and absolute 
errors are observed to be smaller than that for p = 0.1. This 
can be reasoned by the decision boundaries of RED-SVM 
lying in the low density regions of the labeled and unlabeled 
data when the cluster assumption holds strong. Finally, when 
the cluster assumption does not hold (i.e., p > 0.6), both 
transductive and non-transductive approaches fail. 

Later in Section IVI-BI our experimental study shows that 
TOR attains significantly larger improvements over RED- 
SVM in both mean zero-one and absolute errors on the real 
world sentiment datasets than on the benchmark datasets. The 
reason being that, similar to the synthetic data, the real world 
sentiment datasets are composed of sample data which lie in 
sparse high dimensional feature space, so the datasets satisfy 
the cluster assumption more rigorously than the benchmark 
datasets, since the latter contain continuous response variables 
that have been artificially quantized to form the ordinal labels. 

3) Sensitivity of Ci Parameter: In this subsection, we 
investigated the sensitivity of RED-SVM and TOR methods 
for different Ci parametric configurations, particularly in 
the discrete steps of logwCi £ {-3,-2,-1,0,1,2,3}. We 
performed the experiments for K ~ 5 and with 400 labeled 
data. The results depicted in SubFigs. |9] (a) and (b) for 
Bank and Census datasets, respectively, denote the average 
test performances of 20 independent realizations. TOR is 
observed to achieve improved performance on all the settings 
considered, and exhibit a more stable mean zero-one error than 
RED-SVM across the range of Ci values. The performance of 
RED-SVM, on the other hand, is noted to be highly sensitive 
to the changes in Ci values. The robustness in TOR can be 
attributed to the learning from a fusion of labeled data and the 
density distribution estimated from the unlabeled data, when 
maximizing the margin of separation. 

B. Results on Real World Sentiment Datasets 

Here, we apply the proposed TOR on a real world appli- 
cation, particularly. Sentiment ordinal classification datasets. 
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Mean Zero-One Error on Sentiment datasets. Error bars denote the 
deviation 



Since SVOR-EXC and SVOR-IMC are not designed to handle 
the datasets with inputs that are of high dimensions like the 
sentiment datasets, these two algorithms are omitted from the 
experimental study. The results obtained on the remaining 
algorithms are then summarized in Fig. [TO] 

Notably, TOR displayed superior performance over RED- 
SVM, with at least 8% and up to 12% improvements in 
accuracy. Furthermore, even though TOR employs only a small 
number of 100 labeled data samples, complimented by the 
unlabeled data, a significantly lower error relative to RED- 
SVM can be observed, despite the latter using a larger labeled 
data samples of 400. This observation clearly demonstrates the 
effectiveness of using unlabeled data in ordinal regression. 

The mean absolute error metric defined in (ITji is also 
reported for the sentiment dataset, as summarized in Fig. [TT] 
It is worth noting that a mean absolute error larger than one 
indicates the average rating obtained differs from the true label 
by more than one rating scale. For example, RED-SVM with 
a mean absolute error close to one on labeled data of 100 
indicates that the predicted labels of most samples differ from 
their respective true class labels by one unit. On the other hand, 
TOR is observed in Fig.[TT]to exhibit significantly lower mean 
absolute error than the RED-SVM, thus suggesting that the 
predictions made by TOR are closer to the true labels on most 
data samples. Overall, TOR reports significantly lower mean 
absolute error than M-TSVM on all the datasets considered. 

Another interesting observation that can be derived from 
Fig. [TT] pertaining to limited labeled data available. Particu- 
larly, M-TSVM is shown to deliver a lower mean absolute 
error than RED-SVM under the condition of limited labeled 
data, which is made possible by complimenting the learning 
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Fig. 11. Mean Absolute EiTor on Sentiment datasets. Error bars denote the 
standard deviation 



process with the abundant of unlabeled data. As the number 
of available labeled data increases, the ordinal information 
learned by RED-SVM generally helps to lower the mean 
absolute errors as observed in Fig. [TT] In contrast, TOR 
benefited through learning from both the ordinal knowledge 
and the density information of unlabeled data to arrive at 
the improvements in mean absolute error observed over RED- 
SVM and M-TSVM. 

In Figs. [To] and [TTl the error bars representing the stan- 
dard deviation are also presentecd As observed, the standard 
deviation obtained by the transductive algorithms, i.e., M- 
TSVM and TOR, are generally smaller than the inductive 
RED-SVM algorithm, thus acknowledging the robustness of 
the transductive learning paradigm. 

Next, we analyze the label swapping procedure of the TOR 
in details by increasing the number of labels to be used to 
2000. Fig. fT2l depicts the effectiveness of label swapping after 
the label initialization. From the observations, label swapping 
effectively reduces the mean zero-one and absolute errors in 
Fig. \T2\a) and [T2l b). respectively, and while the number of 
labeled data increases, the improvements by TOR are decreas- 
ing. Another observation is that as the number of labeled data 
increases, the number of SVM training iterations within TOR 
will generally decrease as shown in Fig. [T2l' c). This is expected 
since as more labeled data are added into the training set, 
the decision boundaries become less affected by the unlabeled 
data. Therefore, the TOR is deem as more effective when only 
a small number of labeled data is available. 

*For other figures on benchmark datasets, there are too many comparison 
algorithms depicted in those figures. Hence, the errors bars are not provided. 
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Fig. 12. Analysis of TOR after the label initialization phase. SubFigs. (a) 
and (b) depict the differences (improvements) of mean zero-one and mean 
absolute en'ors, respectively, between TOR reaches convergence in Algorithm 
[T]and after TOR initializes the labels. Subfig. (c) depicts the number of SVM 
trainings for TOR to reach convergence. 



In Fig. [I2c), it depicts the number of iterations for TOR 
to converge. Let T be the number of iterations for TOR to 
converge. The computational cost of TOR is then 0{TR), 
where R be the computational cost of RED-SVM. However, 
it is notable here that the training process of TOR can be 
enhanced via a warm-start strategy, i.e., using the previous 
solution of the alpha variables as the initial alpha variables 
for the next iteration. 



VII. Conclusion 

In this paper, by taking benefits from the abundance of 
unlabeled patterns, we had presented a novel transductive 
learning paradigm for ordinal regression, namely Transductive 
Ordinal Regression (TOR). To the best of our knowledge, the 
present work serves as the first attempt that addresses the 
general ordinal regression problem in a transductive setting 
for a family of ordinal loss functions. The family of ordinal 
loss functions including hinge loss, logistic loss and Laplacian 
loss are supported. A proposed label swapping scheme is also 
introduced to guarantee a strictly monotonic decrease in the 
objective value of the transductive ordinal function. Based 
on the experimental results obtained, TOR was reported to 
attain significant accuracy improvements over all the other 
algorithms considered via leveraging the cluster assumption on 
the unlabeled data and the ordinal constraints imposed to max- 
imize the margin of separation between consecutive classes in 
ordinal regression. In situations where only few labeled data 
are available, TOR clearly serves as an indispensable tool. 



